home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
PC-SIG: World of Education
/
PC-SiG's World of Education.iso
/
run
/
0477
/
brekdown.doc
< prev
next >
Wrap
Text File
|
1985-02-06
|
9KB
|
150 lines
BREAK DOWN
A text analysis and generation program
written in TURBO Pascal
copyright 1985 by Neil J. Rubenking
requires DOS 2.0, 96K
I.AN ANALOGY
There is a fairly commonly believed "fact" that if you
set millions of monkeys typing at random on millions of
typewriters for millions of years, they would eventually crank
out all the world's great literature. Deep down inside, BREAK
DOWN is just one of those monkeys, but this monkey is educated.
BREAK DOWN "reads" a text and creates a frequency table that
gives it some method in its madness. The output can be surpri-
singly similar to the input.
II. WHAT DOES IT DO?
To analyze a text, BREAK DOWN looks at it in chunks of
a particular size (one less than the "order") and keeps a record
of what characters occur immediately after that pattern. If
the chunk is new to the frequency table, it is added to the
table. Its frequency array is initially all zeros, except for
the current next character. If the chunk already exists in
the table, its frequency array is incremented by one for the
current next character. Then the "chunk" is shifted one charac-
ter to the right and the process goes on -- that is, the chunk's
first character is dropped and the current next character
is tacked onto the end.
III. WHAT MAKES IT SPECIAL?
Checking to see if the chunk is present would be quite a
task if the frequency table were stored sequentially. Fortuna-
tely, BORLAND International sells a product called TURBO TOOLBOX
that implements fast indexed storage, using the B-Tree system.
(For a discussion of B-Tress and TURBO TOOLBOX, see the February
1985 PC Tech Journal). The chunks of text are stored in the
B-Tree index file -- they are the KEYS. The data file contains
only the frequency arrays. Ordinarily, the KEYS would also be
stored in the data file, but this redundancy is not strictly
necessary. Since the data file can conceivably contain one
record for every BYTE in the source, we want to keep the record
size to a minimum. (This "worst case" would occur if NO pattern
in the text occurred more than once.)
IV. WHAT ARE THE INPUT LIMITS?
At present, BREAK DOWN tracks 34 characters. These are
the 26 lower-case alphabetic characters, the space, period,
comma, dash, question mark, number symbol and single quote, and
ASCII character 20, the paragraph symbol. If a line is shorter
than the constant "LineWidth" (currently set to 55), it is
considered to have ended "early" with a hard Carriage Return, and
is marked with the paragraph symbol at the end. In the pre-
processing phase (procedure "CleanUp"), all letters are converted
to lower-case, all numbers are reduced to a single # symbol,
double quotes become single quotes, and all unused punctuation is
removed. It would, of course, be possible to track more charac-
ters, but each character adds a byte to every record.
V. HOW DOES IT WORK?
To generate a new text, BREAK DOWN selects at random a
KEY that begins with a space (i.e., one that doesn't start in
the middle of a word.) It then looks up the frequency array for
that KEY and selects the next character at random from the
characters with non-zero frequency, weighted by the frequencies.
This character is added to the current output line, and to the
current KEY chunk. If the paragraph symbol is encountered, the
line is automatically ended. Also, the current line ends at the
first space encountered after its length surpasses the LineWidth
constant. The first alphabetic character after a period,question
mark, or line end is capitalized.
VI. HOW SHOULD I SET IT UP?
BREAK DOWN is a prime candidate for RAMDisk operation. The
B-Tree file access limits the number of disk accesses quite
a bit, but there are still several accesses for each BYTE in
the source file. The data file of a text under 10K in length
will definitely fit on one floppy, but an 11K file could
conceivably run over that length. You are expected to be sure
you have enough space. You may distribute your files to various
disk drives -- a likely arrangement is .DAT file on drive B
and source, .INX file, and BREAK DOWN program on drive A.
VII. TIPS FOR USE
The higher the order, the more intelligible the output
will be. However, a high order and a short text will mostly
just regenerate the original. Experiment with various texts
and various orders. You can use the [L]ist option to see just
what sort of records are being generated. BREAK DOWN has been
tried on a 100K text file, with the Order set to the maximum of
8. It took over 6 hours and generated a 1.8 megabyte data file,
but it worked.
VIII. DEFAULT OPTIONS
BREAK DOWN will prompt you for the Order each time you make
a selection from the menu. After you have entered an Order, you
can just hit <return> for the previous value. The MAIN file name
works the same way -- after the first time you enter it, hitting
<return> will recall the same name. The default for the .DAT
and .INX drives, and the output file, if any, is the same as the
source file. Thus, if you have [A]nalyzed a text with all its
files on one drive, you can fill in the blanks for [G]enerating a
travesty by repeatedly hitting <return>.
IX. ADVANCED (?) USE
The [M]erge option lets you combine two data files, possibly
from wildly different sources. The data and index files of
the"source" will be permanently changed, so you may want to keep
a copy. You can also "read in" another text into an existing
data file. Use this option to build up a frequency table
"model" of a particular author, or to make bizarre hybrids.
X. WHAT'S IT GOOD FOR?
BREAK DOWN is a moderately sophisticated database program
with almost no "serious" uses. However, there are all kinds of
non-serious uses for it. Read in three or four "letters from
camp" and then let the PC generate more. Generate new speeches
based on Our President's proclamations. Find out what Lewis
Carroll would have written had he been a Zen Master. Or examine
the program itself to see how the TURBO TOOLBOX can be used to
manage other sorts of data.
XI. CREDITS
This program was inspired by the TRAVESTY program in
the November 1984 BYTE magazine, by Kenner and O'Rourke. They
in turn were inspired by an article in the Scientific American of
November 1983 by Brian Hayes. Both these articles make good
reading. TURBO TOOLBOX and TURBO Pascal are available from
BORLAND International, 4113 Scotts Valley Road, Scotts Valley, CA
95066. BREAK DOWN itself was written by Neil J. Rubenking.
This program may be freely used and copied, but I retain the sole
right to SELL it. Users Groups and Software Clubs may charge a
reasonable price for the disk and copying/handling charges.